{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "5a7cc773",
   "metadata": {},
   "source": [
    "# Recursive URL\n",
    "\n",
    "The `RecursiveUrlLoader` lets you recursively scrape all child links from a root URL and parse them into Documents."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "947d29e7-3679-483d-973f-79ea3403a370",
   "metadata": {},
   "source": [
    "## Setup\n",
    "\n",
    "The `RecursiveUrlLoader` lives in the `langchain-community` package. There's no other required packages, though you will get richer default Document metadata if you have ``beautifulsoup4` installed as well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "23359ab0-8056-4dee-8bff-c38dc079f17f",
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install -qU langchain-community beautifulsoup4"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "07985766-e4e9-4ea1-8a18-924fa4f294e5",
   "metadata": {},
   "source": [
    "## Instantiation\n",
    "\n",
    "Now we can instantiate our document loader object and load Documents:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "cb208dcf-9ce9-4197-bc44-b80d20aa4e50",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_community.document_loaders import RecursiveUrlLoader\n",
    "\n",
    "loader = RecursiveUrlLoader(\n",
    "    \"https://docs.python.org/3.9/\",\n",
    "    # max_depth=2,\n",
    "    # use_async=False,\n",
    "    # extractor=None,\n",
    "    # metadata_extractor=None,\n",
    "    # exclude_dirs=(),\n",
    "    # timeout=10,\n",
    "    # check_response_status=True,\n",
    "    # continue_on_failure=True,\n",
    "    # prevent_outside=True,\n",
    "    # base_url=None,\n",
    "    # ...\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0fac4425-735f-487d-a12b-c8ed2a209039",
   "metadata": {},
   "source": [
    "## Load\n",
    "\n",
    "Use ``.load()`` to synchronously load into memory all Documents, with one\n",
    "Document per visited URL. Starting from the initial URL, we recurse through\n",
    "all linked URLs up to the specified max_depth.\n",
    "\n",
    "Let's run through a basic example of how to use the `RecursiveUrlLoader` on the [Python 3.9 Documentation](https://docs.python.org/3.9/)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "a30843c8-4a59-43dc-bf60-f26532f0f8e1",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/bagatur/.pyenv/versions/3.9.1/lib/python3.9/html/parser.py:170: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features=\"xml\"` into the BeautifulSoup constructor.\n",
      "  k = self.parse_starttag(i)\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{'source': 'https://docs.python.org/3.9/',\n",
       " 'content_type': 'text/html',\n",
       " 'title': '3.9.19 Documentation',\n",
       " 'language': None}"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "docs = loader.load()\n",
    "docs[0].metadata"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "211856ed-6dd7-46c6-859e-11aaea9093db",
   "metadata": {},
   "source": [
    "Great! The first document looks like the root page we started from. Let's look at the metadata of the next document"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "2d842c03-fab8-4097-9f4f-809b2e71c0ba",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'source': 'https://docs.python.org/3.9/using/index.html',\n",
       " 'content_type': 'text/html',\n",
       " 'title': 'Python Setup and Usage — Python 3.9.19 documentation',\n",
       " 'language': None}"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "docs[1].metadata"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f5714ace-7cc5-4c5c-9426-f68342880da0",
   "metadata": {},
   "source": [
    "That url looks like a child of our root page, which is great! Let's move on from metadata to examine the content of one of our documents"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "51dc6c67-6857-4298-9472-08b147f3a631",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "<!DOCTYPE html>\n",
      "\n",
      "<html xmlns=\"http://www.w3.org/1999/xhtml\">\n",
      "  <head>\n",
      "    <meta charset=\"utf-8\" /><title>3.9.19 Documentation</title><meta name=\"viewport\" content=\"width=device-width, initial-scale=1.0\">\n",
      "    \n",
      "    <link rel=\"stylesheet\" href=\"_static/pydoctheme.css\" type=\"text/css\" />\n",
      "    <link rel=\n"
     ]
    }
   ],
   "source": [
    "print(docs[0].page_content[:300])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d87cc239",
   "metadata": {},
   "source": [
    "That certainly looks like HTML that comes from the url https://docs.python.org/3.9/, which is what we expected. Let's now look at some variations we can make to our basic example that can be helpful in different situations. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8f41cc89",
   "metadata": {},
   "source": [
    "## Adding an Extractor\n",
    "\n",
    "By default the loader sets the raw HTML from each link as the Document page content. To parse this HTML into a more human/LLM-friendly format you can pass in a custom ``extractor`` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "33a6f5b8",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/var/folders/td/vzm913rx77x21csd90g63_7c0000gn/T/ipykernel_10935/1083427287.py:6: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features=\"xml\"` into the BeautifulSoup constructor.\n",
      "  soup = BeautifulSoup(html, \"lxml\")\n",
      "/Users/isaachershenson/.pyenv/versions/3.11.9/lib/python3.11/html/parser.py:170: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features=\"xml\"` into the BeautifulSoup constructor.\n",
      "  k = self.parse_starttag(i)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3.9.19 Documentation\n",
      "\n",
      "Download\n",
      "Download these documents\n",
      "Docs by version\n",
      "\n",
      "Python 3.13 (in development)\n",
      "Python 3.12 (stable)\n",
      "Python 3.11 (security-fixes)\n",
      "Python 3.10 (security-fixes)\n",
      "Python 3.9 (securit\n"
     ]
    }
   ],
   "source": [
    "import re\n",
    "\n",
    "from bs4 import BeautifulSoup\n",
    "\n",
    "\n",
    "def bs4_extractor(html: str) -> str:\n",
    "    soup = BeautifulSoup(html, \"lxml\")\n",
    "    return re.sub(r\"\\n\\n+\", \"\\n\\n\", soup.text).strip()\n",
    "\n",
    "\n",
    "loader = RecursiveUrlLoader(\"https://docs.python.org/3.9/\", extractor=bs4_extractor)\n",
    "docs = loader.load()\n",
    "print(docs[0].page_content[:200])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c8e8a826",
   "metadata": {},
   "source": [
    "This looks much nicer!\n",
    "\n",
    "You can similarly pass in a `metadata_extractor` to customize how Document metadata is extracted from the HTTP response. See the [API reference](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.recursive_url_loader.RecursiveUrlLoader.html) for more on this."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1dddbc94",
   "metadata": {},
   "source": [
    "## Lazy loading\n",
    "\n",
    "If we're loading a  large number of Documents and our downstream operations can be done over subsets of all loaded Documents, we can lazily load our Documents one at a time to minimize our memory footprint:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "7d0114fc",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/var/folders/4j/2rz3865x6qg07tx43146py8h0000gn/T/ipykernel_73962/2110507528.py:6: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features=\"xml\"` into the BeautifulSoup constructor.\n",
      "  soup = BeautifulSoup(html, \"lxml\")\n"
     ]
    }
   ],
   "source": [
    "page = []\n",
    "for doc in loader.lazy_load():\n",
    "    page.append(doc)\n",
    "    if len(page) >= 10:\n",
    "        # do some paged operation, e.g.\n",
    "        # index.upsert(page)\n",
    "\n",
    "        page = []"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f88a7c2f-35df-4c3a-b238-f91be2674b96",
   "metadata": {},
   "source": [
    "In this example we never have more than 10 Documents loaded into memory at a time."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3e4d1c8f",
   "metadata": {},
   "source": [
    "## API reference\n",
    "\n",
    "These examples show just a few of the ways in which you can modify the default `RecursiveUrlLoader`, but there are many more modifications that can be made to best fit your use case. Using the parameters `link_regex` and `exclude_dirs` can help you filter out unwanted URLs, `aload()` and `alazy_load()` can be used for aynchronous loading, and more.\n",
    "\n",
    "For detailed information on configuring and calling the ``RecursiveUrlLoader``, please see the API reference: https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.recursive_url_loader.RecursiveUrlLoader.html."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
